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Highly Scaleabt . Multi-tier Message System 
1 Background of the Invention 

Tins invention pertains to the field of Message Oriented Middleware (MOM). "^ ,e 
comn^DroeiLs to exchange discrete messages with each other over a communications network 
^^SSS^^ coupling' of senders and recipients, in that the sender of a me ssage 
ITno Sow dSabout the identit^location or number of recipients of a «^ J^JST 
when an intermediary message server is employed, message delivery can be assured even l when the 
Sate S£ o7the message are unavailable at the time at which it is sent. This can be contrasted 
with Connection Oriented Middleware, which requires a computer program to have defcnk i of Ae 
identity and network location of another computer, in order that it can establish a connection to uxat 
computer before exchanging data with it. To establish a connection both computers must ^ »ad*to 
and responsive during the entire time that the connection is active. Despite the s ™^»^ 
MOM is not e-mail E-mail is a system for moving text messages and attachments to human consumers. 
MOM is for moving messages containing arbitrary data between computer programs. An 
implementation of an E-mail system could be realized using MOM, however. 

This invention pertains specifically to the case where an intermediary message server in employed to 
store and distribute messages. Although the senders and receivers are loosely coupled wrtfieacn other 
when communicating via MOM, these parties are normally required to communicate withthe server in 
a connection-oriented fashion. Thus permitting senders and receivers to communicate without both 
being available at the same time requires the server to be available at all times. Furthermore all clients 
who may wish to exchange messages must be connected to the same server, or different servers which 
are capable or working together to achieve the equivalent functionality of a single server (single logical 
server). One of the reasons for employing MOM is to alleviate the requirement of defining which 
programs may exchange data with each other a priori. This means that large organizations that use 
MOM for computer applications distributed throughout the organization, or organizations that use 
MOM to provide service to the general public over the internet, must be ready to accommodate many 
thousands of programs communicating through a single logical server. In addition, there may be 
demands to be able deliver messages within a limited amount of time. Security trading, live online 
auctions and chat rooms are examples of potential MOM applications that have restriction on the 
amount of time required to deliver messages. These factors combine to create the need for MOM 
servers that can handle large message volumes quickly and reliably. 

The following factors dictate the need for a single logical message server that is implemented using the 
combined resources of multiple physical computers in order to meet the needs of the most demanding 
MOM applications: 

• There are inherent limits on the amount of message throughput that can be achieved with a message 
server running on a single computer. 

• The possibility of hardware failure results in the need for redundant computer hardware containing 
identical copies of all critical data at all times. 

• A group of inexpensive computers may be able to provide a required level of functionality more 
cost effectively that a single large computer. 

In the context of this document, we will define a cluster as a group of computers that work together to 

provide a single service with more speed and higher reliability than can be achieved using a single ^ f. }f 

computer. 0 

A critical measure of the effectiveness of a cluster is scaleability. Scaleability can generally defined as 
the degree to which increased functionality is achieved by employing additional resources. The 
uniqueness of this invention is the way in which it addresses the scaleability issues of message server 
clustering. The specific aspects of scaleability that it addresses are: 

• Scaleability with respect to performance: This is the degree to which adding additional computers to 
the cluster can increase the amount of data that can be delivered with in a time period, or the speed 
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at which an individual mr ?e can delivers to its destinations. 

Scaleabilitv with respect to~connections: Each active connection to the cluster consumes a certain 
l^iZurces, placing a limit on the number of connections that ^ bj, one 
time even if these connections are not used to transfer significant aroounte of date. Tha 
Z de^Se" to which adding additional computers to the cluster increases the number of s.multaneous 

active connections that are possible. 

Scaleability with respect to redundancy: This is the degree to which adding additional computers to 
the cluster can increase the redundancy, and therefore the reliability of the cluster, especially wrth 
regard to data storage. If each piece of data is copied onto two different computers, then any one 
computer can fail without causing data loss. If each piece of data is copied onto three different 
computers, then any two computers can fail without causing data loss. Etc. 

Scaleability with respect to message storage: This is the ability to increase the total storage capacity 
of the cluster by adding more machines. A clustering scheme that requires all computers m the 
cluster to store all messages cannot scale its storage capacity beyond the storage capacity of the least 
capable computer in the cluster. 

Scaleability with respect to message size: This concerns the maximum limit on the size of a single 
message. Unlike the other aspects of scaleabili ty, this is not related to the number of computers m 
the cluster. Conventional message server solutions cause the maximum message size to be 
determined by the amount or working memory (RAM) available in the computers that handle tiie 
message, when other aspects of the implementation do not limit it to be even less than that Tins 
invention alleviates this restriction and allows maximum message size to be limited only by the 
amount of mass storage (hard disk capacity) available on each computer. 



Conventional messaging cluster implementations are extensions of servers architected to run on a single 
computer. Each computer in the cluster is a complete server, with extensions that allow it to work 
together with other servers in the cluster. In order to insure that all messages are available to all 
potential receivers, all servers in the cluster must share information about the existence of messages 
and/or the existence of receivers with all other servers in the cluster. The current state of the art in 
reliable network communications is unicast (point-to-point) network connections. The use of unicast to 
exchange data between all possible pairs of computers in the cluster results in inefficient usage of the 
communications network that severely limits scaleability. In a cluster of N servers, each piece of 
information that a server must share with all other servers in the cluster must be sent N-l times across 
the same communication network. This means that adding additional servers to the cluster causes more 
communications network capacity to be used, even when the actual data rate does not change. This does 
not scale well, since adding large numbers of servers to a cluster will cause the communication network 
to become saturated, even with small numbers of senders and receivers, and low message volumes. 



In contrast, the invention described here assigns different functions to different computers in the cluster. 
The programs running on each individual computer cannot, and need not, operate as a complete server. 
This actually eliminates the need for all computers in the cluster to communicate with all other 
computers in the cluster. Additionally, a reliable multicast (point to multipoint) protocol is employed to 
further reduce the need for identical data be sent multiple times across the same communications 
network. 



This invention is specifically designed to accommodate programs that send and receive messages using 
the Java Message Service (JMS) application programming interface published by Sun Microsystems 
Inc. The definition of this interface is available at htto^riava.su n.conyproducts/ims/docg,html. 



2 Brief Summary of the Invention 

The in vention described in this document is a highly scaleable clustered message server (the cluster). 
The invention uses a unique cluster design to achieve a higher degree of scaleability than has been 
previously possible with this type of server. The cluster is designed to scale well with respect to number 
of connections, message volume, and reliability. This means that the capacity of the cluster in each of 
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these areas will increase as rr machines are added to the cluster. In additio. s designed to be 
t aleaWe "Sect to mes^e size, in that it will not fail to operate with messes of arbrtranly 



laige size. 



The structure of a complete message system in shown in Drawing 1 : Message System Diagram. The 
2SSSS^ S-SL called nodes. Each node™ a program ^Uges one 

part of the cluster. There are two types of node: Message Manager (MM) and Client Manager (CM). A 
Start* insists of one or more CM'slnd one or more MMs. The CM's are responsible for managing 
client connections. The MM's are responsible for storing messages. 



The invention is comprised of the computer software that performs the functions of the CM and the 
MM as well as the part of the software in the client that implement the JMS compatible messaging 
library. The following things are not part of the invention and can vary across different deployments ot 
the invention: 

• computer hardware on which the software runs 

• The number of computers used in the cluster and allocation of CM nodes, primary MM nodes and 
backup MM nodes among those computers. 

• The type and configuration of the network that interconnects the nodes in the cluster 

• The type and configuration of the networks) that connects clients to CM nodes 

• The client application that interacts with the JMS compatible message library 

• The means for detenrrining to which CM a client should connect in order to balance the load among 
all CM nodes. (There is a large variety of existing hardware and software solutions for this which 
are appropriate.) 



In order to connect to the cluster, a client must connect to one of the CMs. All of CM's in a cluster are 
interchangeable. A client will get the exact same service from the cluster, regardless of which CM is 
connects to The CM is responsible for managing client connections, client authentication, access 
control, forwarding messages from producer clients to the MM arid forwarding messages from the MM 
to a consuming client. As stated above, all of the CMs are interchangeable, and additional CMs can be 
added to increase the total number of clients that can be served by the cluster. If a CM fails, the clients 
that were previously connected to that CM may reconnect to another CM and continue functioning 
without any loss of service. 



Messages are stored in a destination until they are consumed. The destination can be a queue or a topic, 
depending on the actual service desired. These terms are defined in the JMS specification. Each 
destination exists on one or more MM's. When a destination exists on more than one MM, one of them 
is designated as the primary and is responsible for providing all of the services of the destination. All 
others MM's containing mat destination are backups, which maintain the same state as the primary, but 
do not provide any services unless the primary fails to function. Increasing the number ofMM*s 
increases the capacity of the cluster to store messages and increases the number of destinations that can 
be accomidated. Increasing the number of MM's also permits an increase in the number of backup 
MM's, which decreases the likelihood of loosing data if multiple nodes rail simultaneously. 



In order to assure that all clients can send messages to, and receive from, all destinations, it is necessary 
that all CM's can communicate with all MM's, and visa versa. It is not necessary for CM's to directly 
communicate with other CM's. It is not necessary for MMs to communicate with each other directly, 
except for communication between primaries and their corresponding backups. This reduces the 
number of connections that must be maintained between node by half, compared to traditional cluster 
designs that require all nodes to be connected to each omer. As discussed below, the use of multicast 
communication removes the need for point to point connections between nodes entirely. Despite this, 
the fact that not all pairs of nodes require direct communications still provides benefit because it allows 
a lot of freedom in creating partitioned network topologies that prevent network communication fix)" 1 
becoming the bottleneck that limits the performance of the cluster, (See Drawing 2: Alternate Network 



PACE 11/77 * RCVD AT 10/2/2005 1:35:37 PM [Eastern Daylight Time] ■ SVR:USPTO-EFXRF-6/24 « DN!S:2738300 * CSID:97051 39948 * DURATION (mm-ss):42-00 



10/02/2005 10:36 9705139948 £ PAGE 12/77 

Topologies) 

The transfer of data between CM's and MM's is achieved using a reliable multicast PJ^ L 
protocols are different than unicast (point to point communication) protocols in ftat they enable one 
piece of data to be distributed to multiple machines across a network without have to send that same 
data over the same network multiple times. It is different than broadcast protocols in that it does not 
require the data to be distributed to all computers on the local network. Multicast is the most efficient 
means of distributing identical data to a limited number of computers on the same local area network. 
The preferred embodiment of this invention uses the reliable multicast communication protocol 
provided by the product iBus//MessageBus from Softwired. 

Since data is distributed via multicast, the primary and backup MM's can receive the same data without 
incurring significantly more network traffic than there would be if no backups were present This means 
mat the cluster can have as many backups as desired, resulting in no limit on the scaleabihty of storage 
redundancy. The cluster does not, however, require that all machines store all messages, which would 
limit the scaleability of cluster storage capacity. 

The unique aspect of this invention is its ability to provide the function of single logical message server, 
while providing a high degree of scaleability in all of the following respects: 

• Scaleability with respect to performance: Load balancing permits performance to scale as the 
number of nodes is increased. Different clients the connect to different CM's and exchange 
messages over different destinations must not access the same nodes at the same time, thus all 
operations done by the cluster on behalf of these clients may execute in parallel. Limits are imposed 
when many clients compete for resources of the same CM or the same MM (too much load on one 
destination), as well as by the data network that interconnects the cluster. When the cluster is 
deployed with: client applications that distribute load evenly over many destinations; client 
connection logic that distributes clients evenly over CM's and network topologies that permit 
maximal parallel data transfer between CM's and MM's, then there is no fixed limit in performance. 

• Scaleability with respect to connections: The number of connections that may be maintained scales 
linearly with the number of CM's. This means mat if each CM can handle n connections, then m 
CM's can handle m x n connections. The number of CM nodes may be increased independently of 
the number of MM nodes. 

• Scaleability with respect to redundancy: The use of multicast data communication allows backup 
nodes maintain data synchronization with their primary node without adding load to the primary or 
consuming additional network bandwidth. This means that a cluster may be deployed with as many 
redundant backups as desired, without a significant impact on cluster performance. 

• Scaleability with respect to message storage: On a single node, message storage is limited by the 
amount of mass storage (hard disk space) that can be attached to that node, as well as the speed at 
which data can be transferred to and from that mass storage. This cluster design does not require all 
MM nodes to store all data. Each primary MM stores different data, and the total amount of storage 
capacity scales linearly with the number of MM nodes, assuming all MM nodes have the same 
storage capacity and the client application is effective in distributing load evenly across destinations. 

• Scaleability with respect to message size: Message size is unrelated to the number of nodes in the 
cluster, but avoiding a fixed limit on the maximum size is also an important scalability issue. This 
cluster design allows clients to send messages that are located only in mass storage. The message is 
read from mass storage in chunks, with each chunk being sent to a CM and forwarded to an MM 
where it is placed back into mass storage. The first chunks of the message may be written to mass 
storage in the MM before the last ones are read from mass storage in the client. Transfer of 
messages from a MM to a consuming client happens in the same fashion. The result of this is that no 
message will cause cause capacity limits to be exceeded, and messages that are extremely large will 
not degrade performance for other messages that are transferred at the same time. 

An additional important feature of this invention is that it does not possess a single point of failure. The 
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Mure of any single function he cluster wilt not cause the entire system to t me ™P e ^j^ 
^^sflJL that prow some form of fruit tolerance still have depends on some system 
aspect whose feilure will render the entire system unusable. 

3 Brief Description of the Drawings 

Drawing 1: Message System Diagram 

This shows atypical message system configuration with multiple instances of each type of node: CM, 
MM primary, MM backup 

Drawing 2: Alternate Network Topologies 

This shows the message system of Drawing 1 with 2 examples of more complex network stnictures 
used to interconnect the nodes of the cluster. This allows increased network speed. Such network 
structures are only possible because not all nodes are required to communicate directly with each other. 

Drawing 3: Client Manager Architecture 

This shows the internal detail of a Client Manager (CM) node. 

Drawing 4: Message Manager Architecture 

This shows the internal detail of a Message Manager (MM) node. 

4 Detailed Description of the Invention 

5 Concurrency Issues 

Compared to the current message server, the clustered server requires a higher degree of concurrency. 
As explained below, this gives rise to a high-level server architecture that is significantly different than 
mat which is implemented in the present server. Spreading the functionaHty of the message server over 
multiple machines gives rise to a number of situations in which the flow of control in one session may 
block for a relatively long period of time, while other sessions could continue to execute if enough 
threads of control are available. Examples of these scenarios are: 

• 2 Phase Commit: Committing a transacted session that is accessing data from multiple MM's 
requires a 2 phase commit protocol (internal to the cluster). This can take a long time to 
complete, as it requires several round trips of communication between the transaction manager 
and the transaction resources. Since the scope of a transaction is limited to one session, other 
sessions should be able to execute uninterrupted during this time. 

• Uneven Load: Despite load balancing efforts, there will be times when individual machines in 
the cluster will be more heavily loaded that others. Sessions that are accessing data stored 
exclusively on lightly loaded machines should not be blocked by sessions that are accessing 
overloaded machines. 

- Very Large Messages: Support for very large messages also give rise to situations where one 
session may need to wait for a very long period of time while bulk data is being transferred. 
Other sessions should be able to send and receive smaller messages during this trme. 

Failure to allow for the proper level of concurrency can cause the entire cluster to exhibit performance 
degradation due to one overloaded machine or one demanding client, even though enough resources 
(CPU, memory, bandwidth) would actually be available. 

Distributing the client connections over many CM processes provides one level of concurrency. As we 
amticipate typical deployments to have tens of thousands of clients, and only tens of CM in a cluster, 
this is not enough. We need many threads within each CM. Indeed, according to the JMS specification, 
one of the reasons that a single client may create multiple sessions is to achieve concurrency, thus it is 
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-.^^•-lAatthpCMbemul' raided at flie session level and not at the comu n level. On the 

tch s SiS ,mus Swvely have it's own thread m order to fulfill the t^uirements described 
ES £ £ £££ SS. thousands sessions on each CM, it is not practical to give each sessu* 
a dedicated thread- Instead thread pooling must be used. 

These arguments apply to the MM as well, except that the unit of concurrency is the destination Each 
^SSSSSito a welWefined message order which precludes <^^f l y^^ 
commands for one destination. The actions of sessions that interact with common destination will 
become at least partially serialized, but sessions that do not access common destinations should be able 
to interleave their operation without restriction. 



6 The Node Architecture 

Diagrams 3 and 4 show block diagrams of both the CM and MM nodes, respectively. They depict high 
level modules and the flow of control. Much of the node specific functionality is encapsulated ,n the 
Session Task of the CM and the Destination Task of the MM. The architecture at the leve shown m the 
diagrams is very similar for both node types, and they share many common components. As such, they 
will be described together, with specific differences noted when appropriate. 

7 Client I/O and Cluster I/O Modules 

These modules decouple the server core from the communications infrastructure. The I/O subsystems 
serve to hide communication functionality from the core server and help divide functionality more 
cleanly into separate modules. The specific responsibilities of the I/O subsystems are: 

. Hiding Connection/Channel details: The functionality of the CM core revolves around the 
session object. JMS inherently groups together sessions by connection, but connections are a 
concept of remote communication only. Thus the client I/O subsystem can completely hide the 
connection details from the rest of the server. It takes on full responsibtfity for opening and 
closing connections, as well as storing all state and properties associated with the connection 
itself. It must, as well, provide a means to map session ID's to connections so that the session 
objects can communicate with their corresponding clients without the need to maintain 
connection information themselves. Likewise the Cluster I/O hides all details of die channels 
(iBus topics) used to communicate with the MM's and provides a mapping from destination ID 

to channel. . j 

- Authentication: This is the act of verifying the identity of the client using name and password 
or a digial certificate. This is primarily relevant for Client I/O, but could be extended to 
Cluster I/O if there is a requirement to insure the identity of nodes joining the cluster. (This 
level of control is expected to be provided by employing firewalls or otherwise isolating the 
cluster network.) , . r ... 

. Connection Access Control: (Client I/O only) Client I/O will reject connections from clients 
who are not authorized to access the message server. 

- Command Routing: The I/O modules are responsible for two aspects of command routing. 
For each inbound command they must identify the command type and route it the appropriate 
dispatcher. For each outbound command they must identify the type and the session or 
destination ID, and use these to determine the channel or connection over which to send the 
command. 

8 The Core 

The core is the most central part of the node. It contains the command dispatchers, command queues 
and command handler tasks. It is the bridge between the single threaded world of command dispatching 
and the multithreaded world of the task objects that handle commands. As stated above, the I/O 
modules are responsible for routing commands based on their type. For each command type, there is a 
command dispatcher. Many of these command dispatchers are very simple and do nothmg more than 
take each command and enqueue it into a thread safe queue. The Session^estination Command 
Dispatcher is a bit more complex. It dispatches to many session tasks, so it must examine the session ID 
contained in the command, and place the command in the correct queue. The Inter-task : Dispatcher is 
similar to the Session Command Dispatcher, but adds the aspect that commands are submitted to the 
dispatcher via multiple thread safe queues. It allows the varrious taks to send notifications to each other 
without requiring excessinve synchonization or creating race conditions. 
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The thread safe queues form fe bridge to the pool of threads, which execut e collection of tasks. 
SchqSe i confined wifw'high water mark'. This is the maximum numbet^f commands that are 
altowe^TaclumulSe in a queue before flow control will be engaged for that sess.on or destmatton. 
See the section on flow control below for more information. 

The task collection consists primarily of a multitude of session tasks. In action each CM will have 
exactly one task for responsible for each of: Session Management, Destination (Proxy) Management, 
Relibility Management, Configuration Data Distribution and Administration. Each MM will have 
exactly one task for responsible for each of: Destination Management Management, Relibility 
Management, Configuration Data Distribution and Administration. All of these tasks « f W™J* lttl 
the Thread Pool Manager, which will distribute a fixed number of threads among all of the tasks that 
have commands waiting to be handled. All tasks must implement the interface necessary to be run by 
the thread pool, but they need not be aware of the thread pool itself. 

• Session Management (CM): Creating new session tasks and registering them with the Session 
Command Dispatcher and the Thread Pool. 

• Destination Management (MM): Creating new destination tasks and registering them with 
the Destination Command Dispatcher and the Thread Pool. 

• Destination Management (CM): The Destination Service of the CM maintains information 
about the destinations that that particular CM interacts with. The Destination Manager task 
processes destination commands that arrive from the cluster and use this to keep the 
Destination Service up to date. Destination commands include creation and destruction of 
destinations, plus flow control status. 

- Session Task (CM): Hiis encapsulates the functions of a JMS Session: Managing consumers 
and producers, publishing and consuming messages, managing transactions, access control, 
etc. 

• Destination task (MM): This encapsulates the functionality of a JMS Destination: storing 
and distributing messages, managing consumers and their message selectors, committing 
transactions, etc. 

• Admin Manager: The Admin Manager is the central coordinarion point for administration of 
the various modules and services in a node. Each module that requires administration can 
register a handler with the Admin Manager. In the CM, the session command dispatcher 
dispatches admin commands, because these commands are routed to the CM through an 
ordinary topic with a reserved name and ID. (See the section on Administration below.) In the 
MM, admin commands have a separate dispatcher, as the MM does not otherwise subscribe to 
topics in other MM*s. 

• Config Distributer Task: This task listens for requests for configuration data from new 
nodes. It is a critical part of the system that insures that all nodes use consistent configuration 
data. A newly started node will request confirmation that if s configuration data is consistent 
the nodes already running in the cluster. The Config Distributer Task of each running node will 
confirm or deny this. If the new node determines that it's config data is not consistent, it will 
request the config data from one existing node. The Config Distributer Task from that nodeis 
responsible for providing this data. 

• Relibility Manager Task: This task is responsible for monitoring view change events (nodes 
or clients apeearing or dissaprearing) delivered to the node by the I/O subsystems. It must take 
appropriate action if necessary. Typical action in the CM will be to close all consumers that 
listen to destinations that no longer exist. Typical action in the MM is lo close consumers that 
belong to session on CM's that are no longer reachable. In a backup MM the Reliability 
Manager Task manages the fail-over process when the primary MM fails. 



9 Other services 



10 Destination Service (CM Only) 

The Destination Service provides essential information about the destinations with which a CM 
interacts. It is responsible for: 

* creating/locating destinations of messages that are being published, in the case of destinations 
that are previously unknown to the CM 
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. maintaining a list of wn destinations with corresponding names ,11 low control status 

and access control hW r „ a 

• maintaining a mapping between destinations and sessions that have producers for those 
destinations or have been publishing to them in the past This information is essential to the 
forwarding of flow control messages. 

1 1 Thread Pool Manager 

The Thread Pool Manager maintains a list of tasks that are to be run in different threads. It maintains a 
SecSonof threads that may be smaller than the total number of tasks. It is able to detect if each task 
needs to run and it will distribute the available threads among these tasks, insuringthat each task runs m 
only one thread at a time. 

12G!obally Unique ID Generation 

Several classes of object will be required to have ID's that are globally unique toou^nttfie cluster. 
These classes include messages, sessions, destinations, consumers, and nodes. A unique ID can be 
generated locally by each node using a combination of the following values: 

• IP Address (perhaps Hmited to subnet address): AJ1 computers that support the Internet 
Protocol OP) for network communications have an IP address that is guaranteed to be unique 
on the local network If a computer is directly connected to the public internet, this address is 
guaranteed to be unique worldwide. Computers in a messaging cluster will often be on an 
isolated network, which may use non-unique TP addresses (usually in the address block 
192.168.xxx.xxx). In this case a configured site ID is required to insure that messages routed 
to other message servers on different isolated networks always have a unique message ID. 

• Site ID: In the case that non-unique (internal) IP addresses are used, the TD can be made 
globally unique by adding a configured site ID. 

• Port Number All computers that support the Internet Protocol (TP) for network 
communications support the concept of ports. A port specifies one of many possible specific 
destinations for data delivers to a computer over an IP network. When an application requests 
to listen on a an IP port it will always be assigned a port number that is unique on that 
computer. This insures that two nodes running on the same computer will generate a non- 
overlapping set of ID's. 

• Locally generated sequence number: The values above will identify a node uniquely. To 
identify the individual sessions, consumers, and messages, a sequence generator will be 
maintained for each of these. A sequence generator may start with zero and must be 
incremented each time an ID is assigned. 

• Start Time: When a node is shut down and restarted, the sequence generators may be reset to 
zero. By adding the time that the node started operating, there is no chance of ID being reused. 

These values should be stored in a data structure that is compact and efficient to use for comparisons 
and hash code generation. One or more long integers or an array of byte are ideal choices. The structure 
must allow enough storage capacity for compact representations of all of the values above, including 
enough capacity for sequence number for all of the ID's that may be generated between restarts of a 
node. (Alternately, the Start Time may be updated if the sequence generator overflows.) 
Only cluster nodes should generate unique ID's. It is difficult to insure that a client would generate truly 
unique ED's using the method described above (especially in the case of potential non IP clients that 
connect via 1RDA, SMS WAP or other protocols). Client should obtain unique ID from the server to 
which they are connected. 



13Very Large Message Handling (VLMH) 

A Very Large Message (VIM) is one that is too big to fit into RAM, or at least too big to be handled 
efficiently in one piece. Unlike smaller messages, which can be embedded directly into a single publish 
command It would be desirable to transfer these large messages file to file using ftp or a similar 
protocol. This would not be sufficient, however. Firewall restrictions may block the additional protocol, 
even though the JMS connection is permitted (or the JMS connection is tunneled through http). This 
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could also lead to aprolifera' of connections. Lastly, data transfer between 3M and MM must be 
multicast to achieve higji avaWil ity using the method described below. 

Very large messages must be be sent from the client to the CM over the same connection that is used 
forsession commands The VLM must be multiplexed with other connection data so that it does not 

sending each piece as a separate command. While small messages can be sent in a 
VJJVl's wilt be sent as a chain of commands, each carrying the next part of the message The CM will 
need to send these fragments to the MM in the same way over an iBus multicast channel. It must begin 
sending to the MM before the last one is received from the client, as it cannot assume that themessage 
will fit in memory. The CM can also employ a disk buffer to insure that the client session is freed as 
soon as possible. 

Consumption of messages works in a similar fashion, with the MM sending the message to the CM in 
fragments, and the CM forwarding the fragments to the client 

It is important to note that VLMs, as they are defined here, cannot be sent or received by a JMS client 
using the standard API, which implicitly assumes that messages can be passed as single objects. Ine 
use of VLMs would require a non standard client method, which pass or receive I/O streams, or a non- 
standard message type, which can embed a handle to the stream in the message object. 

14Flow Control 

There are two levels of flow control: transport level and application level. Transport level flow control 
is provided by the communicati ons protocols in use. In the case of the cluster these are tcp and 
iBus//MessageBus. It is undesirable to rely on on transport level flow control, since there will be a 
variety of commands multiplexed over each connection or channel. One slow receiver would cause 
transport level flow control to block all traffic on the shared connection/channel. Also, when transport 
level flow control is triggered, then there is data stored in internal buffers that is no longer accessible to 
the sender and not yet available to the receiver. It is undesirable for this data to remain 'in transit' for an 
extended period of time until flow is resumed. 

It is more desirable to rely on application level flow control. Since this form of flow control is part of 
the application it can propagate flow control signals all the way to the source of commands (usually a 
client or destination) before flow is actually stopped. If th ese signals are propagated early enough, it is 
possible that commands that are stored in intermediate buffers can be processed before transport level 
flow control is engaged. 

Application level flow control also allows the application to have more specific knowledge of the flow 
state. A queue that know that a particular consumer is blocked can choose to distribute messages to 
other consumer instead of blocking or allowing the consume command sit in intermediate buffers for an 
indefinite period of time. 

• Destinations need to know when the sessions of its consumers are blocked. Queues can use this 
in the distribution decisions. Topics can decide to stop distributing until all consumers are 
unblocked 

• It would be helpful for producers to be able to know when a destination is blocked. They can 
then publish to other destinations or do other task instead of blocking. The JMS API does not 
support this, but features could be added; for example: isBlocked(Destination), trySend()/ 
tryPublishO, or an IsBlocked exception for the existing sendQ and publishO calls. 

• Flow control should be propagated proactivery and asynchronously, so that intermediate 
queues have a chance to flush before downstream blockage occurs. 

• If proactive flow control propagation works as desired, CM sessions do not need to explicitly 
deal with flow control. In reality, transport level flow control can still occur. The CM session 
writes data to one client session and multiple destinations. If one of these is blocked at the 
transport level, the session should not necessarily block. The session should process 
commands in order, but the destinations and clients operate asynchronously (relative to each 
other), so there is no absolute ordering of commands from the two sources: client and cluster. 
Unlike other tasks, the Session Task should have multiple input queues, one from the client, 
one from the cluster, and possibly a separate one for flow control commands from destinations. 
It can peek at the first command in each queue, and select the one that has the highest 
likelihood of succeeding based on the flow control state that it knows about. 
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- The CM session car o be the originator of flow control commands he event that 
commands are buiidbrg up in its input queues faster than it can process diem. 

Application level flow control is implemented using additional commands that travel over the same 
routes as the data that they control. This makes it essential that these commands are sent proactivery, 
e.g. early enough to reach their destination before low level flow control is invoked or system resources 
are exceeded. 

The table below lists the application elements that can issue flow control commands, and to where those 
commands need to be routed. 

Client -> CM session 
-> CM Destination 
Service -> All 
Relevant Destinations 



Who is i blocked 
Client session 
consumer 



!l ^Whqjiejcds to 
destinations of all 



Destination 



consumers in that client 

t 

! session 
session 



^Consumer table in 
{destination 



iFlag in CM session 



ICM sessions that are likely jjLookup table in 
jjto publish to this destination [{Destination Service of CM 




jClient sessions that are 
likely to publish to this 
destination 



[CM Session, client jChent session 



jCM Session, 
{destination input 



lookup table in client 



jjFlag in client session 



tinations of all 
'consumers in that session 



^Consumer table in 
lestination 



Destination -> CM desti 
manager -> CM 
destination service -> 
jAil relevant CM 
Sessions -> All 

jCM Session -> client 

"^M^sion ~'ScM l ' a 
jDestination Service -> 
AH Relevant 
laminations = j 



ISHigh Availability 

High availability is is an issue primarily in the Message Manager, as this is the only part of the system 
that is responsible for storing messages persistently. The Client Manager does not store critical state 
information, so the failure of a CM is relatively easy to deal with. The fail-over procedure for a CM will 
be discussed first. All subsequent discussion wiiJ concern the MM. 

16 High Availability of the CM 

The CM stores only transient state information- Unlike the messages stored in the MM's, none of this 
state is expected to survive a node restart For this reason it is not necessary to maintain redundant 
copies of this state on other CM's, If a CM fails, all clients connected to it will immediately detect that 
the connection is broken. The client library will automatically reconnect, and the connection balancing 
logic will reconnected it to any other CM that is still operating. After connection, the client must 
recreate each session. Parameters in the command to create a session can indicate that this is a session 
that ran previously on another CM and is being resumed* The client will provide information on the last 
messages that were acknowledged by each consumer, and the last sent messages or transactions that 
were completed by each publisher. The client must restart incomplete transactions and resend 
unconfirmed sent messages. 

When a CM fails, all MM , s that had been interacting with that CM will be notified by the group 
membership protocol in the MessageBus. The MM must delete all consumer entries associated with the 
sessions on that CM so that it does not try to distribute messages to clients that are not reachable. These 
entries will be recreated when the client reconnects. The MM must also rollback any messages that 
were part of an uncommitted transaction of any sessions of the defunct CM. 

17High Availability of the MM 

The use of a multicast protocol to transmit data across the network is essential to the High Availability 
scheme, as this permits data to be shared between a primary MM and all of ifs backups without wasting 
network bandwidth. In order to conserve resources, one iBus multicast channel will be shared among all 
of Destinations in one MM. This makes it logical to make the MM the basic unit of fail-over, and not 
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the individual destinations. T design should allow multiple MM*s to exist wi one JVM, so that 
fail-over can be used to selecwdy migrate part of the load from one machine ttralrtother. 

The individual processes that are required to implement High Availability are described below: 

18 Designation of Startup Role 

For each logical MM, the cluster may contain one primary and any number of live backups. As each 
MM starts it must determine whether or not it is the primary. This can be explicitly specified, for 
example in a configuration file, however any fail-over scheme will cause a backup to become a primary 
if no other primary is found within a certain period of time. Likewise, if a backup becomes primary 
because the previous primary was temporarily isolated, there will be 2 primaries that must negotiate to 
determine which will be demoted or stopped. This means that the fail-over scheme and the order of 
node startup will ultimately determine the role of a new MM node. See the discussion of tail-over 
below. 

19 Synchronization of a New Backup MM 

This scenario assumes that a primary and zero or more backups are already live. A new MM is started, 
determin es that it is a backup, and must synchronize it's state (directly or indirectly) with the primary. 
Once it is synchronized, it can remain up to date by monitoring the multicast communication between 
the primary MM and the CNfs. 

The discussion below uses these names to identify the three different parties that could take part in the 
synchronization. Depending on context they refer either to an MM, or one of the Destination Tasks on 
that MM. 

• Primary: The existing MM which is currently operating in primary mode 

• Host: The existing MM which is providing the state to the target MM. This is either the 
Primary or a backup that is already synchronized. 

• Target: the new backup MM which needs synchronization. 

The target MM begins collecting destination commands as soon as it comes online. These are passed to 
the Command Dispatcher, which accumulates them in a generic queue until its Destination Tasks are 
created The target makes a request to locate the primary and any backup that can provide 
synchronization. From the list it receives it selects one (preference should be given to a backup over the 
primary). Once negotiation is complete and the selected MM has agreed to be the synchronization host, 
the target requests a list of destinations from that host. The target creates these destinations, with 
command processing disabled, and registers them with the Command Dispatcher so that it can begin 
accumulating commands in the input queues dedicated to each destination. The following process is 
then executed for each destination, one at time. It is necessary that all commands sent on the multicast 
channels in the course of normal message processing contain the unique id of the session or destination 
that sent it, and a sequence number. It also necessary that the multicast protocol is atomic (either all 
listeners receive each command or none do). 

• Processing of incoming commands is suspended in both host and target destination. 
Commands continue to be accumulated in the incoming queues of both destinations during this 
time. 

• The host destination externalizes its state. This state includes all of the messages currently 
stored in the destination, plus a table containing the sequence number of the last command 
received from the primary destination and each session that has communicated with the 
destination. 

• The host destination may resume processing commands when the previous step is complete. 

• The externalized state is optionally compressed and then transmitted to the target via a point to 
point connection. 

• The target internalizes the state. 

• The target begins processing incoming commands, but must compare the sequence number of 
each command to those received from the host. 

• If the sequence number of a command from a session or the primary destination is less than or 
equal to the corresponding sequence number received from the synchronization host, the 
command is ignored. 

• If the sequence number of a command from a session or the primary destination is one grater 
than the corresponding sequence number received from the synchronization host, the command 
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is processed and cor risori of sequence numbers may be discontinu >r this session or 
primary. * — '' ~ 

• The arrival of a command from a session or the primary destination with a sequence number 
that is more than one greater than the corresponding sequence number received from the host 
represents an error condition that is not possible if the underlying transport medium is 
providing atomic ordered multicast. 

20 Maintaining a Synchronized Backup 

Once a backup is in sync, it can process commands coming from sessions normally. It does not 
distribute messages however. It will also process incoming commands that were sent by the primary 
MM and update its state to remain in sync. 

If a backup MM detects that it has lost synchronization to to excessive message loss (possible if it has 
been disconnected from the network and declared dead by the remaining members) it should change it's 
state to unsynchronized, and repeat the startup procedure. 



21 MM Fail-over 

Fail-over is the process of promoting a backup MM to be primary when the original primary MM rails. 
It consists of the following steps: 

• Recognizing that the primary has failed: The iBus//MessageBus Group Membership 
seTvices,generate an event for all other channel members when one member leaves the channel 
intentionally or unintentionally. The backups MMs will be notified when a node on their 
channel fails, and they must read application tag to see if the railed node was their primary. 

• Designating the new primary (in the case of multiple backups): The backup MM's 
exchange their status with regard to synchronization. Of the up-to-date backup MM's the one 
with the lowest channel rank will become the new primary. 

• Switching the designated backup into primary mode: The backup must change state and 
begin processing as a primary. Message distribution is started. 



22 Multiple Primaries 

If the failure of the primary was due to a temporary network outage, the original primary could reappear 
at any time. One of the primaries must then revert to backup mode. The primaries compare their state 
by exchanging the set of sequence numbers from all command publishes on the MM channel. This 
gives them the chance to determine which primary is most up-to-date. The most up to date one remains, 
any others revert to backup mode. If multiple primaries are fully up-to-date, then the one with the 
lowest rank remains primary. 

23Network Partitioning 

This is a catastrophic situation, in which a primary MM and all of it's backups may become unreachable 
at one time. In this situation normal processing cannot continue, but the cluster should insure that no 
rebroadcast storms result, and that normal processing can resume once the network is restored. 



24Transaction Processing 

JMS specifies the option of using transactions at the session level. This this makes all actions that occur 
between each commit command and the following one an atomic unit: they must either all succeed or 
all fail. The distributed aspect of the cluster gives rise to two major changes in the way that transactions 
are processed compared to a monolithic server. One is the coordination between transaction aspects 
handled by the CM and those handled by the MM. The other is the need for a 2 phase commit protocol 
within the cluster. 



25Transaction steps 

• Produce Message: This occurs in a fashion similar to the non transacted case. The producer 
sends the message and continues processing without waiting for a reply from the server. The 
CM passes the message to the appropriate MM, where it is stored marked as uncommitted. The 
CM adds the message ID to the list of produced messages for the open transaction of the 
corresponding session. 
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• Consume Message e MM sends a message to the CM which forv ; it to a consumer. 
The CM adds the menage ID to the list of consumed messages for the^open transaction of the 
corresponding session. The message continues to be stored in the MM where it is locked until 
the MM received either a commit (equivalent to an ACK) or a rollback. 

• Commit: The list of produced and consumed message ID'S for a session should be organized 
by destination. The CM sends a COMMTT command containing the lists of produced and 
consumed message ID's for all destinations. The list of consumed message !D r s is that which is 
provided by the client. The one stored in the session may contain messages that have not yet 
been delivered to me consumer. If only one destination is involved, this may be a 1 phase 
commit, and the CM may synchronously wait until the reply from that destination arrives. If 
more than one destination is involved then a 2 phase commit is needed. See below for more 
details. 

• Rollback: The CM sends a ROLLBACK command containing the lists of produced and 
consumed message ID's for that destination. The list of consumed message ID stored in the 
session is used, as the message store should be returned to the state it had at the beginning of 
the transaction. 

26Two Phase Commit 

A simple 2-phase commit protocol may be used to commit transactions across multiple destinations. 
The requirements of JMS transactions are Jess demanding than those of many other transactional 
systems. Transactions occurring in different session have no mterdependencies and since one producer 
may not produce in more than one session, JMS sets no restrictions on the relative ordering of messages 
from different transactions. 

The CM, which handles the session that is conducting the transaction, acts as the transaction manager. 
The steps of a 2-phase commit are: 

• COMMIT_PREPARE command request is sent to allMM's and lists all of the destinations 
involved in the transaction and the id's of the consumed and produced messages per 
destination, as well as a unique transaction ID. 

• The Destination Command Distributer disributes copies of the command to each destination 
that is involved in the transaction. 

• Each destination checks that all produced messages for which it is responsible are available in 
the message store and have uncommitted state. It checks that all consumed messages for which 
it is responsible are in the message store and are locked by the session of the transaction. If so, 
it sends a reply containing COMMITREADY and a list of destinations. Otherwise it sends a 
COMMIT_FAIL message. If the MM has no destinations involved in the transaction, then it 
sends a COMMIT_READY message containing no destinations. 

■ If the CM receives COMMITREADY from all involved MM's, then it sends a 
COMMIT FINAL message to the transaction channel, containing the transaction ID. 

• The Commit Manager in each MM forwards the COMMIT FINAL message to each 
destination involved. Each destination changes the state of the committed messages and returns 
COMMITCOMPLETE. If the MM has no destinations involved in the transaction, then it 
sends a COMMIT COMPLETE directly. 

• After all COMMIT COMPLETE messages have been received, the CM returns a success 
message to the client 

• If the CM receives one or more COMMIT_FAIL messages in response to the 
COMMIT_PREPARE, or one or more of the destinations times out, then it sends 
COMMITJROLLBACK messages to all involved destinations and notifies the client of failure. 

27The Role of Backup Processes in Two Phase Commits 

There are several options for the role that backup MM's can play the commit process. They range from 
the optimistic extreme of not including the backup in the commit procedure, to the conservative 
extreme of failing a commit if any backup fails. 

The conservative route incorporates a high risk of causing performance problems. It means that any 
backup MM, which is not not contributing to the function of a normally running system, can cause a 
delay a transaction or cause it to fail if it is not functioning properly. This would mean that the 
increased redundancy that comes from multiple backups can detract from system performance and 
possibly make the system less reliable than a monolithic server. 
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The optimistic route implies ' an unlikely failure scenario could lead to me? * loss. When a JMS 
client successfully returns frcWti commit, that commit would be guaranteed successful on the primary, 
but not on the backup. Atomic multicast guarantees that the backup will receive all commands in the 
event of primary failure, as long as there is at least one surviving channel member that had received all 
commands. This means that the backup wiU eventually receive all commands. Thus, in the scenario that 
a primary commits a transaction and then rails, it is very likely that the backups receive all command, 
but a resource problem on the backup, such as a full disk, could still lead to message loss. 

The optimum solution is to require some, but not all, of the redundant MM's to succeed with the 
commit This means that the primary plus a least one of the backups must commit for the commit to be 
effective. The fail-over protocol will insure that only an up-to-date backup (one that has processed all 
transactions) is allowed to become primary. 



28Wildcard Subscriptions 

Users of a messaging API very often find this feature helpful, and sometimes even essential. The use of 
this technique can eliminate the need to make individual subscriptions to hundreds of individual topics, 
and can insure that new topics that match the subscription criteria will automatically be subscribed to on 
behalf of the client. 

Wildcarding can be implemented mostly in the CM. The consumer must send a subscription request to 
the CM that contains a wildcard string. The string can use a 'glob 1 style wildcard pattern (*7) or a 
regular expression (for power users - there needs to be an indication of which method is being used). 
The CM is not expected to maintain a list of all destinations in existence, just those which which it 
currently interacts. The CM must 'broadcast' a message to all MM's requesting all destinations that 
match the pattern. This is a variation of the basic command required for a CM to locate a destination. 
The CM then generates normal subscriptions to all of the destinations returned. 

The wildcard functionality includes the ability to automatically merge in new destinations that are 
created after the original subscription was made, if their names match the subscription partem. This 
means that each time a new destination is created, it must advertise itself to all of the CM's m so that 
they can compare it's name to their list of wildcard subscriptions. 

29Well Known Destinations 

Other parts of this document state the need for special Veil known* destinations. These are often 
ordinary destinations that have reserved names and/or ID's and which must always exist for 
administrative purposes. The names begin with underscore characters. To avoid name conflicts, clients 
will not be permitted to create destinations with names that begin with underscores. They will usually 
have special access control restrictions. Hie well known destinations include: 

• _ADMIN - The topic for administration commands 

• LOG - The topic for storing and distributing log messages 

• DMQ - The Dead Message Queue 

30Administration and Configuration 

These two areas are closely related, as online administration commands can override values specified in 
configuration files. Often such values must be updated in the configuration files so that changes made 
online persist after node restart. Many aspects of administration affect multiple cluster nodes 
simultaneously, which adds an extra degree of complexity compared to the case of a monolithic server. 
It is necessary to insure that updates that affect multiple nodes are carried out on all affected nodes, and 
that these changes are updated in their configuration files in a synchronized manner. The case of nodes 
which join the cluster late or are not live when updates are made are also considered. 

31 Administration 

The administration subsystem is a generalized framework for remote server administration. Any other 
subsystem in the server may register a handler with it and thereby expose its own set of administration 
commands to the administration client. The nature of administration commands is such that some 
commands are relevant only to an individual node, some are relevant to a subset of nodes, and some are 
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relevant to all nodes. Some e- pies are: 

• One node: browse a queue (MM primary) 

• Subset of nodes: delete a message (MM primary and backups), update user list (all CM*s), get 
list of destinations (all MM's) 

• All nodes: get status 

Some commands require a reply, which, in the distributed case, is actually a composite of the replies 
from many nodes; for example "get list of destinations". 

Administration of the cluster is achieved by adding an Admin Manager to each cluster node. This 
Admin Manager will act like a session that is not associated with a client When ot os created it will 
create a consumer for special topic ADMIN, and awain administration commands on this topic. Since 
access control can be defined per destination, the _ADMIN topic may be a normal JMS topic. The 
Admin Managers will be 'internal clients' within the cluster. An administrative client application is an 
ordinary JMS client, and the lowest level of the client Admin API is the definition of a set of message 
formats. 

The sending of repl ies to the admin client can be handled by specifying a repryTo topic with each 
command that the client sends. The difficulty with receiving replies is that a JMS client cannot know 
how many nodes are active in the cluster, and thus not know how many replies to expect. Waiting for a 
time-out after each command is not practical. Either administration clients must be designed to function 
well despite an unknown number of asynchronous replies, or the replies must contain some cluster 
internal information indicating the total number of replies to expect The former is not an attractive 
option, since the admin API will be available to customers and the semantics should be kept simple. 
The latter is possible, but the cluster design does not explicitly require most subsystems to know the 
overall structure of the cluster. Nevertheless, this information can be made available to the Admin 
Manager. 

The Admin Manager will act like a JMS client that lives inside the cluster. In this way, it can leverage 
the existing messaging infrastructure to communicate with the admin client In the CM, the Admin 
Manager can be implemented as a special subclass of a Session Task which is automatically created 
during node initialization and which is not associated with any client connection. Integrating the Admin 
Manger into the MM is a bit mote complex, since MM's do not will not automatically listen for 
commands from other MM's. m this case an extra element is needed: an Admin Dispatcher that will 
listen for commands on the channel of the ADMIN topic, and pass them to the input queue of the 
Admin Manager. 



32Configuration 

Configuration data is generally required in the early stages of starting a node. For this reason is a good 
idea to use a local file to store configuration data which insures that a node can always (re)start and 
integrate itself into the cluster without depending an the operational state of any other server. The 
configuration system used in the cluster must recognize online changes made to configuration 
parameters via the admin API and update the configuration file to reflect these changes. Additionally, it 
must insure that the updates remain consistent across all nodes. 

For our purposes, all data files required by nodes (configuration, users, ACL's, etc.) will be considered 
configuration files. Let us also divide the parameters into two categories: 

1 . Essential Parameters: those that are essential in order for a node to start, contact other nodes, 
and initialize the Admin Manager 

2. Acquirable Parameters: those that could be acquired from other nodes after the steps above are 
complete 

33 Handling Essential Parameters 

Parameters in the each category should be stored in separate files. For the essential parameters, the files 
should be identical for all nodes, and should not be updated online by the Admin Manager. An 
administrative procedure should be in place to insure that all nodes have identical copies of this fi le. An 
example of this is editing only a master copy and using UNIX rdist or a simi lar utility to push it to the 
individual nodes. Storing the file on central network file system is not an acceptable option as this 
introduces a single point of failure. 
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The duster can support imnv te detection of inconsistent configuration files ig the following 
procedure: — ~" 

• When a node is initialized, it creates a digest of the configuration file that it read. This may be 
a simple checksum calculation. 

• The node requests the corresponding digest from all other nodes. 

■ If the node's own digest and all those received in response to the request are not all identical, 
then and error message is generated and node startup fails. 

34 Handling Acquirable Parameters 

In order to assure consistency across nodes, acquirable parameters should be updated either: 

• off-line by editting the conguration files, when no nodes are active 

• online by issuing commands from an admin client when one or more node is active 

When online modifications are made, nodes that are not online at the time will have 'stale* configuration 
files. During initialization, a node should perform a consistency check similar to that described above 
for essential parameters. In the case that the node detects that it's configuration file is stale, it requests 
that the entire configuration file be sent to it from another node which was already online before the 
node in question started. It then uses this file for configuration, and rewrites it's local configuration 
files. 

This procedure works if files are not permitted to be edited off-line when the cluster is active. If off-line 
editing were allowed, then inconsistencies could be detected, but it could be difficult to determine 
which file version is more correct. Using the last modified date of the file is not sufficient, because both 
files may have changed since they were last in sync, and a merge may be necessary. The use of a 
revision control system could allow such merging to be done automatically, but would introduce a 
single point of failure. The most robust solution is to rely on a certain degree of administrator discipline 
and disallow manual file updates when the cluster is running. 



35Event Logging 

A log API will be available in each node so that mformation about important events or error conditions 
can be recorded. As a minimum, this information should be written to a local file for each node. It is, 
however, difficult to track events that involve different nodes when each nodes log data is in a different 
file. Because of this, log messages are to be published to a well-known topic (JLOG), so that they are 
stored in a fault tolerant manner and can be monitored using a JMS client. All nodes should have well 
synchronized clocks, so that the consolidated log messages can be accurately ordered. (Other aspects of 
the cluster should not be highly sensitive to clock synchronization.) There should be a log dispatcher 
inside each node, which writes log messages to a thread safe queue for processing in another thread. 
This allows priority threads to avoid the high overhead I/O operations associated with logging. 

36Dead Message Queue (DMQ) 

The Dead Message Queue is a well-known queue that is used as the destination of messages for which 
the ultimate disposition is not clear. These may be messages that have remained undeliverable for an 
excessively long period of time, or are undeliverable due to error conditions. It may also be desirable 
for messages that exceed their normal time to live be sent here. The DMQ behaves like a normal queue, 
except for the fact that clients are restricted from publishing to it directly. The messages contained in 
the DMQ should indicate their original destination and the reason for being sent to there. 

37Glossary of Terms Used 

• Cluster: A group of processes that run on more that one computer that work together to act like 
a single message server, but with increased performance and reliability. 

• Node: A single logical process within a cluster. Often a node will correspond to a single 
computer, but this not strictly the case. Multiple nodes sharing a computer will interact with 
other as though they are on different computers connected only by a network. 

• Monolithic Server: A complete message server running as a single node. To a client, a cluster 
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is functionally equiv it to a monolithic server, 

• Server Instance: Generic terra for a single logical message server. This can be a monolithic 
server or cluster as defined above. 

• Client: An application program that uses the JMS API to send messages to, or consume 
messages from, a server instance. 

• Linear Scaleability: Relationship between some system capability (performance, storage 
capacity, etc.) and some system resource in which an increase in the amount of resource 
available causes a proportional increase in the system capability. In the case of linear 
scaleability, a plot of capability vs. resource results in a straight line. 

- JMS: (Java Message Service) A standard application programming interface (API) for 
programs written in the Java language to use for accessing the services of a message system. 



38We claim: 

These are samples, the real claims need to be written by a patent attorney. 

t. A message system for delivering data in the form of discrete asynchronous messages between 
message clients, consisting of a server cluster with at least two nodes, where at least on node has the 
function of managing client connections and at least one node has the function of storing and 
distributing messages, and at least one client employing a library written in the Java language and 
conforming to the Java Message Service APL 

2. A message system according to claim 1 where an increase in the number of clients and an increase 
in the number nodes of both types affects an increase in the total amount of message data that can be 
passed through the system in a given amount of time. 

3. A message system according to claim 1 where the number of nodes cluster node dedicated to each 
of the two functions may be varied independently. 

4. A message system according to claim 1 where an increase in the number of nodes cluster node 
dedicated to managing client connections affects an increase in the number of simultaneous 
connections that may be supported. 

5. A message system according to claim 1 where an increase in the number of nodes cluster node 
dedicated to storing and distributing messages affects an increase in the total effective storage 
capacity of the system. 

6. A message system according to claim 1 where the failure of any single node in the cluster will not 
render the entire system inoperable. 

7. A message system according to claim 1 in which not all possible pairs of nodes in the server cluster 
are required to exchange data directly. 

8. A message system according to claim 1, in which a reliable multicast communications protocol is 
used for inter-node data transfer to maintain 1 or more identical, redundant copies of stored data 
from the same data transfer that maintains the original copy of stored data. 

9. A message system according to claim 1 that supports the delivery of arbitrarily large messages. 
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$9 Abstract of the Disclosure 

The invention described in this document is a highly scaleable messaging system featuring a message 
server cluster. It belong$ to a class of messaging system that allows different client applications to send 
discrete messages to each other asynchronously and independently of whether the receiving client is 
available at the time a message is sent The system guarantees delivery of the message by storing it until 
receiver is ready to consume it. This messaging system is designed to accommodate the most 
demanding needs by providing fault tolerance and a very high degree of scalability of system 
capabilities by increasing the number of computers used in the cluster. Specifically the scalability of the 
following system aspects is addressed: performance, storage capacity, connection capacity, data 
redundancy, and message size. 
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40Drawings 




Cluster > 



*X1 



Drawing 1: Message System Diagram 
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Drawing 2; Alternate Network Topologies 
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Drawing 3: Client Manager Architecture 
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Drawing 4: Message Manager Architecture 
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the cluster can increase the amount of data that can be delivered with in a time period, or the speed 
at which an individual message can delivers to its destinations. 

6S capability with respect to connections: Each active connection to the cluster consumes a certain 
amount of system resources, placing a limit on the number of connections that can be active at one 
time, even if these connections are not used to transfer significant amounts of data. This describes 
the degree to which adding additional computers to the cluster increases the number of 
simultaneous active connections that are possible. 

dScaleability with respect to redundancy: This is the degree to which adding additional computers to 
the cluster can increase the redundancy, and therefore the reliability of the cluster, especially with 
regard to data storage. If each piece of data is copied onto two different-computers, then any one 
computer can fail without causing data loss. If each piece of data is copied onto three different 
computers, then any two computers can fail without causing data loss. Etc. 

dScaleability with respect to message storage: This is the ability to increase the total storage capacity of 
the cluster by adding more machines. A clustering scheme that requires all computers in the cluster 
to store all messages cannot scale its storage capacity beyond the storage capacity of the least 
capable computer in the cluster. 

dScaleability with respect to message size: This concerns the maximum limit on the size of a single 
message. Unlike the other aspects of scaleability, this is not related to the number of computers in 
the cluster. Conventional message server solutions cause the maximum message size to be 
determined by the amount or working memory (RAM) available in the computers that handle the 
message, when other aspects of the implementation do not limit it to be even less than that This 
invention alleviates this restriction and allows maximum message size to be limited only by the 
amount of mass storage (hard disk capacity) available on each computer. 



Conventional messaging cluster implementations are extensions of servers architected to run on a 
single computer. Each computer in the cluster is a complete server, with extensions that allow it to 
work together with other servers in the cluster. In order to insure that all messages are available to all 
potential receivers, all servers in the cluster must share information about the existence of messages 
and/or the existence of receivers with all other servers in the cluster. The current state of the an in 
reliable network communications is unicast (point-to-point) network connections. The use of unicast to 
exchange data between all possible pairs of computers in the cluster results in inefficient usage of the 
communications network mat severely limits scaleability. In a cluster of N servers, each piece of 
information that a server must share with all other servers in the cluster must be sent N-l times across 
the same communication network. This means that adding additional servers to the cluster causes more 
communications network capacity to be used, even when the actual data rate does not change. This 
does not scale well, since adding large numbers of servers to a cluster will cause the communication 
network to become saturated, even with small numbers of senders and receivers, and low message 

In contrast, the invention described here assigns different functions to different computers in the 
cluster- The programs running on each individual computer cannot, and need not, operate as a complete 
server. This actually eliminates the need for all computers in the cluster to communicate with all other 
computers in the cluster. Additionally, a reliable multicast (point to multipoint) protocol is employed to 
further reduce the need for identical data be sent multiple times across the same communications 
network. 



This invention is specifically designed to accommodate programs that send and receive messages using 
the Java Message Service (JMS) application progymnming interface published by Sun Microsystems 
Inc. The definition of this interface is available at http ava. sun, com/products/iras/docs .html . 



2 Brief Summary of the Invention 

The invention described in this document is a highly scaleable clustered message server (the cluster). ( ) 
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The invention uses a unique cluster design to achieve a higher degree of scaleability than has been 
previously possible with this type of server, The cluster is designed to scale well with respect to 
number of connections, message volume, and reliability. This means that the capacity of the cluster in 
each of these areas will increase as more machines are added to the cluster. In addition it is designed to 
be scaleable with respect to message size, in that it will not fail to operate with messages of arbitrarily 
large size. 



The structure of a complete message system in shown in Drawing 1: Message System Diagram. The 

cluster is comprised of individual machines called nodes. Each node runs a program that constitutes 

one part of the cluster. There are two types of node: Message Manager (MM) and Client Manager /c) (6? ) 

(CM). A cluster consists of one or more CM's and one or more MM's. The CM's are responsible for ' 

managing client connections. The MM's are responsible for storing messages. 



The invention is comprised of the computer software that performs the functions of the CM and the 
MM, as well as the part of the software in the client that implement the JMS compatible messaging 
library. The following things are not part of the invention and can vary across different deployments of 
the invention: 

bcornputer hardware on which the software runs 

6The number of computers used in the cluster and allocation of CM nodes, primary MM nodes and 
backup MM nodes among those computers. 

6The type and configuration of the network that interconnects the nodes in the cluster 

6 The type and configuration of the network(s) that connects clients to CM nodes 

oThe client application that interacts with the JMS compatible message library 

6The means for determining to which CM a client should connect in order to balance the load among 
all CM nodes. (There is a large variety of existing hardware and software solutions for this which 
are appropriate.) 

In order to connect to the cluster, a client must connect to one of the CM's. All of CM's in a cluster are 
interchangeable. A client will get the exact same service from the cluster, regardless of which CM is 
connects to. The CM is responsible for managing client connections, client authentication, access 
control, forwarding messages from producer clients to the MM and forwarding messages from the MM 
to a consuming client. As stated above, all of the CM's are interchangeable, and additional CM's. can be 
added to increase the total number of clients that can be served by the cluster. If a CM fails, the clients 
that were previously connected to thai CM may reconnect to another CM and continue functioning 
without any loss of service. 




Messages are stored in a destination until they are consumed. The destination can be a queue or a topic, ( ) 
depending on the actual service desired These terms are defined in the JMS specification. Each " -- ' 

destination exists on one or more MM's. When a destination exists on more man one MM, one of them 
is designated as the primary and is responsible for providing all of the services of the destination. All 
others MM's containing that destination are backups, which maintain the same state as the primary, but 
do not provide any services unless the primary fails to function. Increasing the number of MM's 
increases the capacity of the cluster to store messages and increases the number of destinations that can 
be accomidated. Increasing the number of MM's also permits an increase in the number of backup 
MM's, which decreases the likelihood of loosing data if multiple nodes fail simultaneously. 



In order to assure that all clients can send messages to, and receive from, all destinations, it is necessary 
that all CM's can communicate with all MM's, and visa versa. It is not necessary for CM's to directly — 
communicate with other CM's. It is not necessary for MM's to communicate with each other directly, 
except for communication between primaries and their corresponding backups. This reduces the 
number of connections that must be maintained between node by half, compared to traditional cluster 
designs that require all nodes to be connected to each other. As discussed below, the use of multicast (^) 
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communication removes the need for point to point connections between nodes entirely. Despite this, 
the fact that not all pairs of nodes require direct communications still provides benefit because it allows 
a lot of freedom in creating partitioned network topologies that prevent network communication from 
becoming the bottleneck that limits the performance of the cluster. (See Drawing 2: Alternate Network 
Topologies) 

The transfer of data between CNTs and MM's is achieved using a reliable multicast protocol. Multicast (IP) 
protocols are different than unicast (point to point communication) protocols in that they enable one V ~ 
piece of data to be distributed to multiple machines across a network without have to send that same 
data over the same network multiple times. It is different than broadcast protocols in mat it does not 
require the data to be distributed to all computers on the local network. Multicast is the most efficient 
means of distributing identical data to a limited number of computers on the same local area network. 
The preferred embodiment of this invention uses the reliable multicast communication protocol 
provided by the product iBus//MessageBus from Sofrwired. 

Since data is distributed via multicast, the primary and backup MM's can receive the same data without 
incurring significantly more network traffic than there would be if no backups were present This 
means that the cluster can have as many backups as desired, resulting in no limit on the scaleabiliry of 
storage redundancy. The chaster does not, however, require that all machines store all messages which 
would limit the scaleabiliry of cluster storage capacity. 

The unique aspect of this invention is its ability to provide the function of single logical message 
server, while providing a high degree of scaleabihty in all of the following respects: 

6Scaleability with respect to performance: Load balancing permits performance to scale as the number 
of nodes is increased Different clients the connect to different CM's and exchange messages over 
different destinations must not access the same nodes at the same time, thus all operations done by 
the cluster on behalf of these clients may execute in parallel. Limits are imposed when many cheats 
compete for resources of the same CM or the same MM (too much load on one destination) as well 
as by the data network that interconnects the cluster. When the cluster is deployed with: client 
applications that distribute load evenly over many destinations; client connection logic that 
distributes clients evenly over CM's and network topologies that permit maximal parallel data 
transfer between CM's and MM's, then there is no fixed limit in performance. 

oScaleability with respect to connections: The number of connections that may be maintained scales 
linearly with the number of CM's. This means that if each CM can handle n connections, then m 
CMs can handle m x n connections. The number of CM nodes may be increased independently of 
the number of MM nodes. 

dScaleability with respect to redundancy: The use of multicast data communication allows backup 
nodes maintain data synchronization with their primary node without adding load to the primary or 
consuming .additional network bandwidth. This means that a cluster may be deployed with as many 
redundant backups as desired, without a significant impact on cluster performance. 

6Scaleability ^with respect to message storage: On a single node, message storage is limited by the 
amount of mass storage (hard disk space) that can be attached to that node, as well as the speed at 
which data can be transferred to and from that mass storage. This cluster design does not require all 
MM nodes to store all data. Each primary MM stores different data, and the total amount of storage 
capacity scales linearly with the number of MM nodes, assuming all MM nodes have the same 

storage.capacity and tte client application is effective in distributing load evenly across 

destinations. 

dScaleability with respect to message size: Message size is unrelated to the number of nodes in the 
cluster but avoiding a fixed limit on the maximum size is also an important scalability issue This 
cluster design allows clients to send messages that are located only in mass storage. The message is 
read from mass storage in chunks, with each chunk being sent to a CM and forwarded to an MM 
where it is placed back into mass storage. The first chunks of the message may be written to mass 
storage in the MM before the last ones are read from mass storage in the client. Transfer of 
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